Getting Started with Batch Job Scheduling (Batch computing Part II) (COMPLECS)

Remote event

High-performance computing (HPC) systems are specialized resources in use and shared by many researchers across all domains of science, engineering, and beyond. In order to distribute these advanced computing resources in an efficient, fair, and organized way, most of the computational workloads run on these systems are executed as batch jobs, which are simply prescripted sets of commands that are executed on a subset of an HPC system’s compute resources for a given amount of time. Researchers submit these batch jobs as scripts to a batch job scheduler, the software that controls and tracks where and when the batch jobs submitted to the system will eventually be run. However, if this is your first time using an HPC system and interacting with a batch job scheduler like Slurm, then writing and submitting your first batch job scripts to them may be somewhat intimidating due to the inherent complexity of these systems. Moreover, the schedulers can be configured in many different ways and will often have unique features and options that vary from system to system, which you will also need to consider when writing and submitting your batch jobs.

In this second part of our series on Batch Computing, we will introduce you to the concept of a distributed batch job scheduler — what they are, why they exist, and how they work — using the Slurm Workload Manager as our reference implementation and testbed. You will then learn how to write your first job script and submit it to an HPC System running Slurm as its scheduler. We will also discuss the best practices for how to structure your batch job scripts, teach you how to leverage Slurm environment variables, and provide tips on how to request resources from the scheduler to get your work done faster. 

To complete the exercises covered in Part II, you will need access to an HPC system running the Slurm Workload Manager as its batch job scheduler.

Instructor

Marty Kandes, Ph.D.

Computational & Data Science Research Specialist, SDSC

Marty Kandes is a Senior Computational and Data Science Research Specialist at the San Diego Supercomputer Center (SDSC). As part of the High-Performance Computing (HPC) User Services Group within the Data-Enabled Scientific Computing Division, he provides technical user support and services to the national research community leveraging the Advanced Cyberinfrasurcture (CI) and HPC resources designed, built and operated by SDSC on behalf of the U.S. National Science Foundation (NSF). Marty is also a member of the National Artificial Intelligence (AI) Research Institute for Intelligent CI with Computational Learning in the Environment (ICICLE). His current research interests include problems in distributed AI inference over wireless networks, data privacy in natural language processing, and secure interactive computing. He also contributes to many of the education, outreach, and training initiatives at SDSC, including serving as a Co-PI for the COMPrehensive Learning for end-users to Effectively utilize CyberinfraStructure (COMPLECS) CyberTraining program and as mentor for the Research Experience for High School Students (REHS) program. Marty received his Ph.D. in Computational Science from the Computational Science Research Center (CSRC) at San Diego State University (SDSU), where he studied quantum systems in rotating frames of reference through the use of numerical simulations. He also holds an M.S. in Physics from SDSU and dual B.S. degrees in Applied Mathematics and Physics from the University of Michigan, Ann Arbor.